Skip to content

NIXL EP: Use VMM API for device memory allocation.#1415

Merged
itayalroy merged 32 commits intoai-dynamo:mainfrom
ofirfarjun7:topic/nixl-ep-use-vmm-api
Apr 5, 2026
Merged

NIXL EP: Use VMM API for device memory allocation.#1415
itayalroy merged 32 commits intoai-dynamo:mainfrom
ofirfarjun7:topic/nixl-ep-use-vmm-api

Conversation

@ofirfarjun7
Copy link
Copy Markdown
Collaborator

@ofirfarjun7 ofirfarjun7 commented Mar 8, 2026

What?

Use VMM API for device memory allocation in nixl_ep

Why?

To support multi node nvlink.

How?

  • Create cuda allocator wrapper.
  • Replace calls for cudaMalloc with new allocator.
  • Fallback to cudaMalloc if fabric is not supported

Summary by CodeRabbit

  • Refactor

    • Endpoint memory management now uses virtual-memory-backed allocation throughout, replacing direct device allocations for more robust initialization, teardown, and automatic cleanup.
  • New Features

    • Added support for virtual memory / RDMA-backed device regions, enabling larger, more flexible buffers and improved interoperability with advanced device capabilities.

@ofirfarjun7 ofirfarjun7 requested review from a team, ebarilanM and itayalroy as code owners March 8, 2026 16:45
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 8, 2026

👋 Hi ofirfarjun7! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@ofirfarjun7 ofirfarjun7 changed the title Topic/nixl ep use vmm api NIXL EP: Use VMM API for device memory allocation. Mar 8, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 8, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds CUDA Driver VMM-backed allocation support: introduces vmm_region, vmm_init(size_t, CUdevice) and vmm_free(vmm_region&), replaces direct cudaMalloc/cudaFree with VMM-backed allocations in Buffer (workspace, RDMA, mask, sync, sync-count), updates init/destroy flows and memory-view integration.

Changes

Cohort / File(s) Summary
Header: types & Buffer fields
examples/device/ep/csrc/nixl_ep.hpp
Adds struct vmm_region { CUdeviceptr ptr; size_t size; CUmemGenericAllocationHandle handle; bool is_cuda_malloc; }; adds private vmm_region members to Buffer for workspace, rdma, mask, sync, and sync_count; adds <cuda.h>, <cuda_runtime.h>, and <stdexcept> includes.
Source: VMM helpers & allocator logic
examples/device/ep/csrc/nixl_ep.cpp
Introduces vmm_init(size_t, CUdevice) and vmm_free(vmm_region&), internal cuda_alloc_ctx, device capability/granularity checks, and fallback to cudaMalloc; replaces prior cudaMalloc/cudaFree with VMM-backed allocations and stores regions in new m_*_alloc members.
Buffer lifecycle & memory-view integration
examples/device/ep/csrc/...
Updates Buffer::init and Buffer::destroy to allocate/free via vmm_init/vmm_free, assign pointer fields from m_*_alloc.ptr, reset pointers to nullptr, and adapt calls to _nixl_ep_memory_views_create / _nixl_ep_memory_views_destroy to use VMM regions.
Includes & compilation
examples/device/ep/csrc/...
Adds CUDA Driver API includes and runtime headers required for VMM and driver calls.

Sequence Diagram(s)

mermaid
sequenceDiagram
rect rgba(200,200,255,0.5)
participant App as Buffer (app)
end
rect rgba(200,255,200,0.5)
participant Driver as CUDA Driver
end
rect rgba(255,200,200,0.5)
participant Device as GPU/Device
end
rect rgba(255,255,200,0.5)
participant Fallback as cudaMalloc/runtime
end

App->>Driver: vmm_init(size, device)
Driver->>Driver: check device attributes & granularity
alt VMM supported
Driver->>Device: CUmemCreateAllocation / map / reserve VMM
Device-->>Driver: allocation handle & device pointer
Driver-->>App: vmm_region {ptr, size, handle, is_cuda_malloc=false}
else Fallback
Driver->>Fallback: cudaMalloc(size)
Fallback-->>Driver: device pointer
Driver-->>App: vmm_region {ptr, size, handle=0, is_cuda_malloc=true}
end
App->>Driver: use ptr for buffers / create memory views
App->>Driver: vmm_free(vmm_region)
Driver->>Device: CUmemRelease / unmap (or cudaFree if fallback)
Driver-->>App: freed

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I swapped my mallocs for mapped terrain,
Handles hug regions, pointers train,
Granularity snug, no stray pain,
Views aligned along the lane,
Hop—VMM carrots in my brain 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: replacing direct CUDA allocations with VMM API for device memory allocation in NIXL EP.
Description check ✅ Passed The PR description includes all required sections (What, Why, How) and provides sufficient detail about the changes and their purpose.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/device/ep/csrc/nixl_ep.hpp`:
- Around line 66-77: The two calls to cuDeviceGetAttribute (checking
CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_WITH_CUDA_VMM_SUPPORTED and
CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED) do not check their CUresult
return values; update the code around the variables rdma_vmm_supported and
fabric_supported to capture the CUresult, test it against CUDA_SUCCESS, and on
failure throw or log a runtime_error that includes the cuGetErrorString result
and context (which attribute failed and for which device); ensure you only rely
on rdma_vmm_supported/fabric_supported after the call succeeds so you don't act
on zero-initialized values.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8bc85385-7d61-405a-90d0-e86c5ca8956c

📥 Commits

Reviewing files that changed from the base of the PR and between 1870127 and 07674ee.

📒 Files selected for processing (2)
  • examples/device/ep/csrc/nixl_ep.cpp
  • examples/device/ep/csrc/nixl_ep.hpp

Comment thread examples/device/ep/csrc/nixl_ep.hpp Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/device/ep/csrc/nixl_ep.hpp`:
- Around line 102-110: The destructor ~cuda_allocator() currently unmaps and
releases VMM state without waiting for GPU work; fix by calling
cudaDeviceSynchronize() at the start of ~cuda_allocator() before any
cuMemUnmap/cuMemAddressFree/cuMemRelease calls so all in-flight
kernels/transfers are fenced; additionally, ensure allocator creation paths
cannot bypass that fence on exception by either making explicitly_destroy
default to false or wrapping allocator construction in the init paths
(_nixl_agent_init(), _nixl_ep_init(), or any init() that creates the allocator)
with a try/catch that calls cudaDeviceSynchronize() before rethrowing so
stack-unwound destructor runs safe; keep references to the methods destroy() and
~cuda_allocator() when making changes.
- Around line 56-64: The allocator currently queries the ambient CUDA context
via cuCtxGetDevice() which is unsafe; change cuda_allocator to accept an
explicit CUdevice (or device_id) parameter, set the context explicitly inside
the constructor using cuCtxSetCurrent() and handle errors (throw on failure),
then update Buffer::init call sites to pass the tracked device_id into
cuda_allocator so allocations are bound to the correct device regardless of
external context switches; ensure you remove the cuCtxGetDevice() usage in
cuda_allocator and propagate failures with clear runtime_error messages.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d14696d0-b771-4297-bed3-a7596c6c4643

📥 Commits

Reviewing files that changed from the base of the PR and between 07674ee and 7b60106.

📒 Files selected for processing (1)
  • examples/device/ep/csrc/nixl_ep.hpp

Comment thread examples/device/ep/csrc/nixl_ep.hpp Outdated
Comment thread examples/device/ep/csrc/nixl_ep.hpp Outdated
Comment thread examples/device/ep/csrc/nixl_ep.hpp Outdated
Comment thread examples/device/ep/csrc/nixl_ep.hpp Outdated
Comment thread examples/device/ep/csrc/nixl_ep.hpp Outdated
Comment thread examples/device/ep/csrc/nixl_ep.hpp Outdated
Comment thread examples/device/ep/csrc/nixl_ep.hpp Outdated
@ofirfarjun7
Copy link
Copy Markdown
Collaborator Author

/build

@ofirfarjun7 ofirfarjun7 marked this pull request as draft March 9, 2026 16:33
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/device/ep/csrc/nixl_ep.cpp`:
- Around line 70-127: The static cuda_alloc_ctx ctx binds device-specific fields
(prop.location.id, granularity, fabric support) to whichever device was active
at first cuCtxGetDevice() call, causing wrong-device VMM allocations; change
initialization so driver/version checks remain global but device-specific
queries (cuCtxGetDevice(), cuDeviceGetAttribute(),
cuMemGetAllocationGranularity()) are performed per-call or cached per-device
(keyed by device ID) instead of in the static cuda_alloc_ctx constructor—either
remove static cuda_alloc_ctx ctx and build a ctx per vmm_init()/allocation (with
a device-ID cache), or split cuda_alloc_ctx into a static global verifier and a
per-device struct populated on each allocation using the current context.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 120c00f8-eacc-4630-9658-e92390364a9c

📥 Commits

Reviewing files that changed from the base of the PR and between 7b60106 and 25b2e5b.

📒 Files selected for processing (2)
  • examples/device/ep/csrc/nixl_ep.cpp
  • examples/device/ep/csrc/nixl_ep.hpp

Comment thread examples/device/ep/csrc/nixl_ep.cpp Outdated
Comment thread examples/device/ep/csrc/nixl_ep.hpp Outdated
Comment thread examples/device/ep/csrc/vmm.hpp Outdated
Comment thread examples/device/ep/csrc/vmm.cpp Outdated
Comment thread examples/device/ep/csrc/vmm.hpp Outdated
Comment thread examples/device/ep/csrc/vmm.hpp Outdated
Comment thread examples/device/ep/csrc/vmm.cpp Outdated
Comment thread examples/device/ep/csrc/vmm.cpp Outdated
Comment thread examples/device/ep/csrc/vmm.cpp Outdated
Comment thread examples/device/ep/csrc/vmm.cpp Outdated
Comment thread examples/device/ep/csrc/vmm.cpp
Comment thread examples/device/ep/csrc/vmm.cpp Outdated
Comment thread examples/device/ep/csrc/vmm.hpp Outdated
Comment thread examples/device/ep/csrc/cuda_utils.hpp Outdated
Comment thread examples/device/ep/csrc/vmm.hpp
rakhmets
rakhmets previously approved these changes Mar 30, 2026
Comment thread examples/device/ep/csrc/vmm.cpp Outdated
Comment thread examples/device/ep/csrc/vmm.cpp Outdated
Comment thread examples/device/ep/csrc/vmm.cpp Outdated
@ofirfarjun7
Copy link
Copy Markdown
Collaborator Author

/build

@brminich
Copy link
Copy Markdown
Contributor

/ok to test b1cab66

@brminich
Copy link
Copy Markdown
Contributor

/build

Copy link
Copy Markdown
Contributor

@itayalroy itayalroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple of questions:

  1. How was it tested? I think we should make sure it does not break non-MNNVL systems
  2. Do we have any perf numbers?
    • Do we see the expected improvement on MNNVL systems?
    • Is there a degradation on non-MNNVL systems?

@ofirfarjun7
Copy link
Copy Markdown
Collaborator Author

ofirfarjun7 commented Mar 31, 2026

I tested it on DFW & Lyris (supports mnnvl). Lyris both single node and multinode.
I will share perf numbers. Hard to compare on Lyris because without VMM multinode with nvlink is not working.

@ofirfarjun7
Copy link
Copy Markdown
Collaborator Author

ofirfarjun7 commented Apr 5, 2026

@itayalroy

DFW, 2 nodes, 8 proc per node, 16 experts, 256 tokens, 8 topk.
main branch
Only IB
node0
[rank 12] Dispatch bandwidth: 45.26 GB/s, avg_t=333.61 us | Combine bandwidth: 45.31 GB/s, avg_t=644.75 us
[rank 15] Dispatch bandwidth: 46.76 GB/s, avg_t=322.90 us | Combine bandwidth: 44.56 GB/s, avg_t=655.75 us
[rank 8] Dispatch bandwidth: 45.18 GB/s, avg_t=334.18 us | Combine bandwidth: 45.39 GB/s, avg_t=643.63 us
[rank 13] Dispatch bandwidth: 47.15 GB/s, avg_t=320.20 us | Combine bandwidth: 44.40 GB/s, avg_t=658.06 us
[rank 11] Dispatch bandwidth: 46.00 GB/s, avg_t=328.18 us | Combine bandwidth: 44.95 GB/s, avg_t=649.93 us
[rank 10] Dispatch bandwidth: 44.65 GB/s, avg_t=338.14 us | Combine bandwidth: 45.66 GB/s, avg_t=639.82 us
[rank 9] Dispatch bandwidth: 43.87 GB/s, avg_t=344.12 us | Combine bandwidth: 46.10 GB/s, avg_t=633.73 us
[rank 14] Dispatch bandwidth: 46.77 GB/s, avg_t=322.82 us | Combine bandwidth: 44.55 GB/s, avg_t=655.77 us

With NVLINK
node0
[rank 5] Dispatch bandwidth: 79.72 GB/s, avg_t=189.38 us | Combine bandwidth: 75.93 GB/s, avg_t=384.78 us
[rank 7] Dispatch bandwidth: 82.27 GB/s, avg_t=183.52 us | Combine bandwidth: 74.81 GB/s, avg_t=390.56 us
[rank 4] Dispatch bandwidth: 83.99 GB/s, avg_t=179.75 us | Combine bandwidth: 74.02 GB/s, avg_t=394.72 us
[rank 6] Dispatch bandwidth: 77.95 GB/s, avg_t=193.67 us | Combine bandwidth: 76.82 GB/s, avg_t=380.35 us
[rank 3] Dispatch bandwidth: 82.85 GB/s, avg_t=182.23 us | Combine bandwidth: 74.47 GB/s, avg_t=392.33 us
[rank 2] Dispatch bandwidth: 84.46 GB/s, avg_t=178.76 us | Combine bandwidth: 73.78 GB/s, avg_t=396.02 us
[rank 1] Dispatch bandwidth: 81.95 GB/s, avg_t=184.22 us | Combine bandwidth: 74.95 GB/s, avg_t=389.81 us
[rank 0] Dispatch bandwidth: 83.91 GB/s, avg_t=179.93 us | Combine bandwidth: 74.11 GB/s, avg_t=394.24 us

PR
Only IB
node0
[rank 15] Dispatch bandwidth: 46.79 GB/s, avg_t=322.64 us | Combine bandwidth: 44.52 GB/s, avg_t=656.22 us
[rank 8] Dispatch bandwidth: 44.61 GB/s, avg_t=338.46 us | Combine bandwidth: 45.66 GB/s, avg_t=639.88 us
[rank 10] Dispatch bandwidth: 44.95 GB/s, avg_t=335.89 us | Combine bandwidth: 45.46 GB/s, avg_t=642.67 us
[rank 13] Dispatch bandwidth: 46.74 GB/s, avg_t=322.98 us | Combine bandwidth: 44.55 GB/s, avg_t=655.89 us
[rank 9] Dispatch bandwidth: 44.27 GB/s, avg_t=341.03 us | Combine bandwidth: 45.86 GB/s, avg_t=637.09 us
[rank 12] Dispatch bandwidth: 45.77 GB/s, avg_t=329.82 us | Combine bandwidth: 45.07 GB/s, avg_t=648.25 us
[rank 11] Dispatch bandwidth: 45.98 GB/s, avg_t=328.38 us | Combine bandwidth: 44.95 GB/s, avg_t=650.03 us
[rank 14] Dispatch bandwidth: 46.42 GB/s, avg_t=325.25 us | Combine bandwidth: 44.69 GB/s, avg_t=653.83 us

With NVLINK
node0
[rank 4] Dispatch + combine bandwidth: 56.37 GB/s, avg_t=786.08 us, min_t=760.42 us, max_t=803.52 us
[rank 1] Dispatch bandwidth: 80.88 GB/s, avg_t=186.68 us | Combine bandwidth: 75.43 GB/s, avg_t=387.32 us
[rank 7] Dispatch bandwidth: 82.17 GB/s, avg_t=183.75 us | Combine bandwidth: 74.88 GB/s, avg_t=390.16 us
[rank 3] Dispatch bandwidth: 81.83 GB/s, avg_t=184.50 us | Combine bandwidth: 75.01 GB/s, avg_t=389.48 us
[rank 2] Dispatch bandwidth: 83.14 GB/s, avg_t=181.60 us | Combine bandwidth: 74.46 GB/s, avg_t=392.37 us
[rank 5] Dispatch bandwidth: 78.84 GB/s, avg_t=191.50 us | Combine bandwidth: 76.26 GB/s, avg_t=383.11 us
[rank 6] Dispatch bandwidth: 77.68 GB/s, avg_t=194.36 us | Combine bandwidth: 76.95 GB/s, avg_t=379.67 us
[rank 4] Dispatch bandwidth: 83.62 GB/s, avg_t=180.55 us | Combine bandwidth: 74.20 GB/s, avg_t=393.75 us
[rank 0] Dispatch bandwidth: 82.60 GB/s, avg_t=182.79 us | Combine bandwidth: 74.61 GB/s, avg_t=391.58 us


Lyris, 2 nodes, 4 proc per node, 16 experts, 128 tokens, 8 topk.
main branch
Only IB
node0
[rank 2] Dispatch bandwidth: 56.02 GB/s, avg_t=134.09 us | Combine bandwidth: 58.45 GB/s, avg_t=248.69 us
[rank 3] Dispatch bandwidth: 54.75 GB/s, avg_t=137.21 us | Combine bandwidth: 59.28 GB/s, avg_t=245.23 us
[rank 0] Dispatch bandwidth: 57.44 GB/s, avg_t=130.78 us | Combine bandwidth: 57.78 GB/s, avg_t=251.57 us
[rank 1] Dispatch bandwidth: 56.23 GB/s, avg_t=133.59 us | Combine bandwidth: 58.46 GB/s, avg_t=248.67 us

With NVLINK (multi node nvlink)
node0
Not supported

PR
Only IB
node0
[rank 2] Dispatch bandwidth: 52.92 GB/s, avg_t=141.94 us | Combine bandwidth: 58.37 GB/s, avg_t=249.04 us
[rank 0] Dispatch bandwidth: 53.66 GB/s, avg_t=140.00 us | Combine bandwidth: 56.98 GB/s, avg_t=255.14 us
[rank 1] Dispatch bandwidth: 54.53 GB/s, avg_t=137.75 us | Combine bandwidth: 56.45 GB/s, avg_t=257.51 us
[rank 3] Dispatch bandwidth: 51.98 GB/s, avg_t=144.52 us | Combine bandwidth: 58.91 GB/s, avg_t=246.75 us

With NVLINK (multi node nvlink)
node0
[rank 7] Dispatch bandwidth: 271.28 GB/s, avg_t=27.69 us | Combine bandwidth: 390.77 GB/s, avg_t=37.20 us
[rank 6] Dispatch bandwidth: 268.05 GB/s, avg_t=28.02 us | Combine bandwidth: 387.42 GB/s, avg_t=37.52 us
[rank 5] Dispatch bandwidth: 270.85 GB/s, avg_t=27.73 us | Combine bandwidth: 384.65 GB/s, avg_t=37.79 us
[rank 4] Dispatch bandwidth: 271.35 GB/s, avg_t=27.68 us | Combine bandwidth: 384.09 GB/s, avg_t=37.85 us

@itayalroy itayalroy self-requested a review April 5, 2026 15:02
@itayalroy itayalroy merged commit 06491f8 into ai-dynamo:main Apr 5, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants